Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mlx_whisper: add support for audio input from stdin #1012

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

anthonywu
Copy link

@anthonywu anthonywu commented Oct 3, 2024

Problem

I wanted to pipe an audio file to mlx_whisper, but found it only accepted file paths. This PR will allow mlx_whisper to accept stdin and pass it to ffmpeg accordingly then allow the rest of the workflow to go on as usual.

Changes

  1. load_audio helper adjusts ffmpeg flags based on file path vs. stdin mode
  2. CLI parser will gracefully omit the otherwise-required positional audio arg if stdin is determined to be active
  3. optionally, --input-name arg is supported to help users name the otherwise anonymous stdin content (cannot guess from file path)
  4. added tests in macOS standard zsh file to drive and test the changes from the CLI

Process

  1. ran black and pre-commit on changes prior to PR
  2. python test.py shows 4 errors, some regarding floating point comparisons. Looks very far away from my change, may be known issues.

@awni
Copy link
Member

awni commented Oct 3, 2024

Thanks for the addition. What do you think about a couple modifications:

  • For piping from stdin use - as in mlx_whisper - . That is what we do in MLX LM so it is more consistent.
  • The argument --input-name is confusing to me. I understand it now but I think it will in general be confusing. It might be more clear to allow an optional --output-names argument with appropriate defaults (basename when available or output when not).

@anthonywu
Copy link
Author

anthonywu commented Oct 4, 2024

  • For piping from stdin use - as in mlx_whisper - . That is what we do in MLX LM so it is more consistent.

Done. I agree self consistency between related projects is worth more than aesthetic preferences. This does have the nice effect of eliminating test cases.

The only tradeoff is users who reflexively think they can pipe anything into any tool's bare name will have to read the docs.

  • The argument --input-name is confusing to me. I understand it now but I think it will in general be confusing. It might be more clear to allow an optional --output-names argument with appropriate defaults (basename when available or output when not).

I've come around to --output-name and have proposed a "template" solution that preserves existing behavior, but also leaves room for future improvements such as fancy rename strategies based on transcribed audio content, or allows for power users to produce diff variations of output names when they use the same audio_path but use diff parameters.

Comment on lines +31 to +33

parser.add_argument("audio", nargs="+", help="Audio file(s) to transcribe")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my black==24.4.2 insists on re-formatting this line which didn't change

parser.add_argument(
"--output-name",
type=str,
default="{basename}",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this defaults to pre-existing behavior, I would consider the default an implementation detail not for the user to be concerned about. If this is too much, we can default to None and handle the None inside the implementation.

Comment on lines +47 to +59
if isinstance(audio_obj, (str, pathlib.Path)):
basename = pathlib.Path(audio_obj).stem
else:
# mx.array, np.ndarray, etc
basename = "content"

output_basename = self.output_name_template.format(basename=basename)

output_path = (pathlib.Path(self.output_dir) / output_basename).with_suffix(
f".{self.extension}"
)

with open(output_path, "w", encoding="utf-8") as f:
with output_path.open("wt", encoding="utf-8") as f:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored using more "modern" Pathlib, see https://docs.astral.sh/ruff/rules/builtin-open/

Comment on lines +27 to +30
--output-name "{basename}_mwpl_${test_val}" \
--output-dir "$TEST_OUTPUT_DIR" \
--output-format srt \
--max-words-per-line $test_val \
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this I think can be useful for research, run a variation of transcriptions by adjusting knobs, then name those outputs appropriately

@@ -0,0 +1,69 @@
#!/bin/zsh -e
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bare bones shell test harness to add coverage for now, without involving this PR in choosing a higher level shell test runner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants